[1]:
%load_ext autoreload
%autoreload 2

Google Analytics Analysis + Pipeline Example

In this notebook, we will be analyzing the open bigquery google analytics dataset.

Objectives: - Demonstrate capability to process complex data sources - Demonstrate capability to convert the analysis into a pipeline which can be reran over time with little effort.

Note on Privacy & Security

Privacy of your data is of primte importance. This library has been specifically designed to NOT share any part of your data with the Genie APIs. Just the metadata about your data like name and types of columns of a pandas dataframe would be shared, which help in generating high quality results.

setup

access token

Don’t have an access token yet? Sign up for free!

Store your access token in a .env file with the same format as shown in the .env.example file

[2]:
from dotenv import load_dotenv
[3]:
load_dotenv(".env")  # replace with path to your env file
[3]:
True

We’re using dotenv library here, alternatively you can simply set the environment variable as:

import os
os.environ["CODE_GENIE_TOKEN"] = "your-access-token-here"

Though it is not recommended to do so as notebooks are typically shared with others and you should keep your access token private.

genie cache

By default the package will cache genie invocations in a temp file. This would mean that any genies created will be lost once you restart the kernel or rerun the notebook at another time. to keep the cached genies, you can set a custom path where genies would be stored so that you will not lose them when you rerun the notebook.

You can modify the cached code if you’d like and those modifications would be loaded the next time you run the code.

[4]:
CACHE_DIR = "./_cache_google_analytics"

Analysis objective

We will use the google analytics dataset and create a pipeline which can be run once every week and generate stats for the given week. We will create 2 tables containing following information:

  1. Page level bounce and exit rate stats

  2. User funnel stats

Load Data

Lets load 1 week of data using bigquery from July 1 - July 7 2017

[5]:
start_date = "20170701"
end_date = "20170707"
[6]:
from google.cloud import bigquery
client = bigquery.Client()
[7]:
query = f"""
select * from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
where _TABLE_SUFFIX between '{start_date}' and '{end_date}'
""".format(start_date=start_date, end_date=end_date)
df = client.query(query).result().to_dataframe()
df.shape
[7]:
(15920, 16)
[8]:
df.head(2)
[8]:
visitorId visitNumber visitId visitStartTime date totals trafficSource device geoNetwork customDimensions hits fullVisitorId userId clientId channelGrouping socialEngagementType
0 <NA> 1 1499117293 1499117293 20170703 {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... {'referralPath': None, 'campaign': '(not set)'... {'browser': 'Safari', 'browserVersion': 'not a... {'continent': 'Americas', 'subContinent': 'Cen... [{'index': 4, 'value': 'Central America'}] [{'hitNumber': 1, 'time': 0, 'hour': 14, 'minu... 1018154947831642966 None None Organic Search Not Socially Engaged
1 <NA> 1 1499132474 1499132474 20170703 {'visits': 1, 'hits': 16, 'pageviews': 10, 'ti... {'referralPath': None, 'campaign': '(not set)'... {'browser': 'Chrome', 'browserVersion': 'not a... {'continent': 'Americas', 'subContinent': 'Nor... [{'index': 4, 'value': 'North America'}] [{'hitNumber': 1, 'time': 0, 'hour': 18, 'minu... 1770622829105680991 None None Organic Search Not Socially Engaged

Extract information

In order to perform the given analysis, we need to extract the following nested fields into their own columns: - hits.page.pagePath: landing_page - totals.bounces: bounces - hits.type: hit_type - hits.hitNumber: hit_number - hits.eCommerceAction.action_type: action_type

[9]:
from code_genie import Genie
[10]:
genie = Genie(data=df, cache_dir=CACHE_DIR)

The hits column contains multiple values in a single row, we need to convert that into separate rows

expand hits into separate rows

Note that update_base_input=True means that the data in the genie is update and next genie invocations will use this data as input.

[11]:
gr_hits_expanded = genie.plz("""
each value in hits column contains a list. create a new dataframe where a separate row exists for each value.
copy the totals columns to the new df, keep the same value for each item in the list. drop all other columns except hits and totals.
""", update_base_input=True)
Loading cached genie id: separate_hits_70651, set override = True to rerun
[12]:
gr_hits_expanded.result.head(2)
[12]:
hits totals
0 {'hitNumber': 1, 'time': 0, 'hour': 14, 'minut... {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...
0 {'hitNumber': 2, 'time': 56142, 'hour': 14, 'm... {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti...

extract page path column

[13]:
gr_page_path = genie.plz("""
each value of hits is a dictionary with page as a key. the value of page is also a dictionary with pagePath as a key.
extract pagePath into a new column called page_path
""", update_base_input=True)
Loading cached genie id: extract_page_path_78866, set override = True to rerun
[14]:
gr_page_path.result.head(2)
[14]:
hits totals page_path
0 {'hitNumber': 1, 'time': 0, 'hour': 14, 'minut... {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... /home
0 {'hitNumber': 2, 'time': 56142, 'hour': 14, 'm... {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... /home

extract hit_type

[15]:
gr_hit_type = genie.custom("""
def run(df):
    df["hit_type"] = df["hits"].apply(lambda x: x["type"])
    return df
""", update_base_input=True)
Genie cached with id: run_51794

extract hit_number

[16]:
gr_hit_number = genie.plz("""
each value of hits is a dictionary with hitNumber as a key. extract type into a new column called hit_number
""", update_base_input=True)
Loading cached genie id: extract_hit_number_77980, set override = True to rerun

Let’s see another example of using a custom genie

extract action_type

[17]:
gr_action_type = genie.custom("""
def run(df):
    df["action_type"] = df["hits"].apply(lambda x: int(x["eCommerceAction"]["action_type"]))
    return df
""", update_base_input=True)
Genie cached with id: run_20093

extract bounce

[18]:
gr_bounce = genie.plz("""
each value of totals is a dictionary. create a new boolean column called bounces which is True if the key bounces in totals has a value of 1
""", update_base_input=True)
Loading cached genie id: add_bounces_column_93508, set override = True to rerun
[19]:
gr_bounce.result.head(2)
[19]:
hits totals page_path hit_type hit_number action_type bounces
0 {'hitNumber': 1, 'time': 0, 'hour': 14, 'minut... {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... /home PAGE 1 0 False
0 {'hitNumber': 2, 'time': 56142, 'hour': 14, 'm... {'visits': 1, 'hits': 16, 'pageviews': 12, 'ti... /home PAGE 2 0 False

Aggregate data

landing page bounce rate

[20]:
gr_landing_page = genie.plz([
    "filter data for hit_number as 1 and hit_type as PAGE",
    """group data by page_path, create 2 aggregate columns:
views: total number of rows in the group
bounce_rate: fraction of rows in the group where bounces column is True
""",
    "sort the data by views descending"])
Loading cached genie id: get_page_bounce_rate_81411, set override = True to rerun
[21]:
gr_landing_page.result.head(5)
[21]:
views bounce_rate
page_path
/home 7698 0.489738
/google+redesign/shop+by+brand/youtube 3911 0.629507
/google+redesign/shop+by+brand/waze+baby+on+board+window+decal.axd 681 0.654919
/google+redesign/apparel/mens/mens+t+shirts 585 0.482051
/signin.html 225 0.342222

exit rates of pages

[22]:
gr_exit_rate = genie.plz([
    "filter data for hit_type as PAGE",
    """group data by page_path, create 2 aggregate columns:
views: total number of rows in the group
exit_rate: fraction of rows in the group where bounces column is True
""",
    "sort the data by views descending"])
Loading cached genie id: aggregate_data_92888, set override = True to rerun
[23]:
gr_exit_rate.result.head(5)
[23]:
views exit_rate
page_path
/home 13174 0.287308
/google+redesign/shop+by+brand/youtube 4988 0.493585
/basket.html 2426 0.027205
/signin.html 1512 0.050926
/google+redesign/apparel/mens/mens+t+shirts 1468 0.192098

user funnel

[24]:
gr_funnel = genie.plz([
    "group data by action_type and count number of rows in each group",
    "remove action_type other than 1, 2, 5, 6",
    """replace the action_type values as:
    1: Click on product list page
    2: Product details page
    5: Checkout
    6: Purchase Complete
    """])
Loading cached genie id: group_action_type_27490, set override = True to rerun
[25]:
gr_funnel.result
[25]:
action_type count
1 Click on product list page 6887
2 Product details page 5386
5 Checkout 1115
6 Purchase Complete 292

Convert into a pipeline

Now that we have done all the work in terms of analyzing data and creating what we want, we can put this into a pipeline. Links to API documentation: - Pipeline - Arguments - Sources - Sinks

Define a source

The first step in a pipeline is to define a source of external data which will feed the pipeline. We need to use one of the pre-defined data sources, let’s use the BQ data source here.

[26]:
from code_genie.io import BigQueryToDataframeSource, StringArg
[27]:
# difference from how we used above is that start_date and end_date need to come from inputs to the pipeline
query = f"""
select * from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
where _TABLE_SUFFIX between '{start_date}' and '{end_date}'
"""
source = BigQueryToDataframeSource(
    query=query,
    query_args={"start_date": StringArg(name="analysis-start-date"),  # need to be set during pipeline run
                "end_date": StringArg(name="analysis-end-date")},  # need to be set during pipeline run
    key_path=StringArg(env_var="GOOGLE_APPLICATION_CREDENTIALS")  # to be read from env var
)

Define sinks

Now that we have a source, we need to define where to export the results. In this example, let’s export the 3 dataframes we created as csv files into the cache dir using the DataFrameToCsvSink

[28]:
from code_genie.io import DataFrameToCsvSink
[29]:
sink_bounce_rate = DataFrameToCsvSink(path=StringArg(name="bounce-rate-export-path"))
sink_exit_rate = DataFrameToCsvSink(path=StringArg(name="exit-rate-export-path"))
sink_funnel = DataFrameToCsvSink(path=StringArg(name="funnel-export-path"))

Define pipeline steps

Once we have our source and sinks, we can stich them together with the genies we have already create to make a pipeline. Read the pipeline docs for more info.

[30]:
from code_genie.pipeline import PipelineStep, GeniePipeline
[31]:
# initialize pipeline step
pipeline_steps = []

Use source data and expand hits

[32]:
pipeline_steps.append(PipelineStep(genie_result=gr_hits_expanded, data=source))
len(pipeline_steps)
[32]:
1

Add all extraction steps

Note that here the input source is the previous genie as we want to carry forward all the columns added to the very end

[33]:
pipeline_steps.extend([
    # extract page path
    PipelineStep(genie_result=gr_page_path,
                 data=gr_hits_expanded),
    # extract hit type
    PipelineStep(genie_result=gr_hit_type,
                 data=gr_page_path),
    # extract hit number
    PipelineStep(genie_result=gr_hit_number,
                 data=gr_hit_type),
    # extract action type
    PipelineStep(genie_result=gr_action_type,
                 data=gr_hit_number),
    # extract bounce
    PipelineStep(genie_result=gr_bounce,
                 data=gr_action_type)
])
len(pipeline_steps)
[33]:
6

Export results to sinks

Now that we have interim layers added, we can add the final aggregation layer and assign the corresponding sinks so that the output data is generated. Note that the base_input_genie argument for all of these is still gr_bounce as the result of that genie needs to be fed into all of them.

We will also add the sink parameter to these steps so that the results are generated

[34]:
pipeline_steps.extend([
    PipelineStep(genie_result=gr_landing_page, data=gr_bounce, sink=sink_bounce_rate),
    PipelineStep(genie_result=gr_exit_rate, data=gr_bounce, sink=sink_exit_rate),
    PipelineStep(genie_result=gr_funnel, data=gr_bounce, sink=sink_funnel)
])
len(pipeline_steps)
[34]:
9

Create pipeline and export

[35]:
pipeline = GeniePipeline(name="google-analytics-pipeline", version="1", steps=pipeline_steps, cache_dir=CACHE_DIR)
[36]:
pipeline.export("pipeline-v1.json")

Run the pipeline

Now that we have created the pipeline, we can load it into a new session and run it. From this point, the code does not depend on the execution of any pieces of code before this as long as we have exported the pipeline. You can restart your kernel and run again from here on.

rerun setup

lets load our env vars and setup cache dir as before

[1]:
from dotenv import load_dotenv
load_dotenv(".env")  # replace with path to your env file
CACHE_DIR = "./_cache_google_analytics"

load the pipeline and run

Lets rerun for the same dates as above to see if we can get the same results back.

[2]:
from code_genie.pipeline import GeniePipeline
[3]:
pipeline = GeniePipeline.load(f"{CACHE_DIR}/pipeline-v1.json")
[4]:
# lets create a temporary directory to store the generated datasets
from tempfile import mkdtemp
export_dir = mkdtemp()
[5]:
pipeline_args={
    "analysis-start-date": "20170701",
    "analysis-end-date": "20170707",
    "bounce-rate-export-path": f"{export_dir}/bounce_rate.csv",
    "exit-rate-export-path": f"{export_dir}/exit_rate.csv",
    "funnel-export-path": f"{export_dir}/funnel_stats.csv"
}
pipeline.run(pipeline_args)
Running step 1: separate_hits_70651
        Completed in 13.2 seconds
Running step 2: extract_page_path_78866
        Completed in 0.0 seconds
Running step 3: run_75120
        Completed in 0.0 seconds
Running step 4: extract_hit_number_77980
        Completed in 0.0 seconds
Running step 5: run_83895
        Completed in 0.0 seconds
Running step 6: add_bounces_column_93508
        Completed in 0.0 seconds
Running step 7: get_page_bounce_rate_81411
        Completed in 0.0 seconds
Running step 8: aggregate_data_92888
        Completed in 0.0 seconds
Running step 9: group_action_type_27490
        Completed in 0.0 seconds
[6]:
# load results
import pandas as pd
pd.read_csv(pipeline_args["bounce-rate-export-path"]).head(5)
[6]:
page_path views bounce_rate
0 /home 7698 0.489738
1 /google+redesign/shop+by+brand/youtube 3911 0.629507
2 /google+redesign/shop+by+brand/waze+baby+on+bo... 681 0.654919
3 /google+redesign/apparel/mens/mens+t+shirts 585 0.482051
4 /signin.html 225 0.342222
[7]:
pd.read_csv(pipeline_args["exit-rate-export-path"]).head(5)
[7]:
page_path views exit_rate
0 /home 13174 0.287308
1 /google+redesign/shop+by+brand/youtube 4988 0.493585
2 /basket.html 2426 0.027205
3 /signin.html 1512 0.050926
4 /google+redesign/apparel/mens/mens+t+shirts 1468 0.192098
[8]:
pd.read_csv(pipeline_args["funnel-export-path"]).head(5)
[8]:
index action_type count
0 1 Click on product list page 6887
1 2 Product details page 5386
2 5 Checkout 1115
3 6 Purchase Complete 292